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Preface 


This  topic  was  selected  because  of  a continuing  need  in  the 
R&D  community  to  make  an  intelligibility  evaluation  of  experimental 
and  prototype  voice  communications  systems.  Because  the  Air  Force  does 
not  have  a more  automated  way  to  test  intelligibility  that  produces 
adequate  accuracy  and  is  relatively  easy  to  use,  they  are  still  using 
human  listener  panels  to  make  these  determinations.  I have  the 
feeling  that  "there  must  be  a better  way"  to  make  these  tests. 

It  seems  reasonable  that  if  present  state-of-the-art  digital  com- 
puter techniques  can  synthesize  speech,  it  should  be  possible  to 
determine  the  intelligibility  of  speech  using  computer  processing. 
Hopefully  the  approach  used  in  this  thesis  will  provide  at  least  a 
basis  for  development  of  a computerized  method  for  measuring  intelligi- 
bility that  will  prove  to  be  sufficiently  accurate  and  simple  to  replace 
the  human  listener  method.  The  ultimate  development  of  this  type 
technique  would  provide  a device  which  could  be  hooked  to  the  communi- 
cations system  under  test  and  have  a meter  which  would  indicate  the 
intelligibility  of  the  system  on  a real-time  basis. 

I am  indebted  to  Major  Joe  Carl,  my  advisor,  for  his  guidance, 
suggestions,  advice,  and  encouragement  during  the  preparation  of  this 
thesis.  I would  like  to  express  my  appreciation  to  Captain  Mazzie 
and  Mr.  William  Hall,  Jr.  of  the  Analog/Hybrid  Systems  Branch  of  the 
ASD  Computer  Center  for  the  many  hours  they  spent  on  the  preliminary 
processing  of  the  analog  speech  data.  My  thanks  to  Dr.  Oestreicher  and 
Richard  McKinley  of  the  Aerospace  Medical  Research  Laboratory  for 
allowing  me  to  use  their  anechoic  chamber  to  prepare  the  voice  data  tapes 
and  their  computer  terminal  to  develop  the  computer  algorithms  and  process 
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the  data.  I would  also  like  to  express  my  appreciation  to  Captain  John 
Bauer  for  providing  all  the  subjective  listener  data  used  as  a basis 
for  comparison  in  this  thesis. 


Wayne  R.  Beeson 
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Abstract 

A method  of  predicting  speech  intelligibility  using  computer 
algorithms  is  presented.  Diagnostic  Rhyme  test  number  four  was 
used  to  measure  speech  intelligibility  using  a subjective  listener 
test  and  these  results  were  used  as  a basis  for  comparison  with  the 
intellibility  predictions  made  by  the  computer  algorithm.  Ar.  audio 
recording  of  a speaker  reading  the  Diagnostic  Rhyme  test  was  made. 

This  recording  was  run  through  a General  Electric  radio  system  and 
varying  amounts  of  noise  were  added.  The  output  of  the  radio  system 
was  recorded,  providing  a copy  of  the  input  word  corrupted  by  both 
additive  noise  and  radio  system  distortion  effects.  Both  the  input 
recording  and  the  noisy  output  recording  were  digitized  by  sampling 
the  analog  waveform?'  at  a 10  kilohertz  rate.  These  digital  samples 
were  converted  to  a frequency  format  by  windowing  the  time  samples 
with  a rectangular  window  128  time  samples  in  length  and  processing 
them  using  Fast  Fourier  transform  techniques.  This  procedure  sim- 
ulated running  the  analog  speech  signal  through  a bank  of  contiguous 
narrow  bandpass  filters  covering  the  range  of  0 to  5 KHz,  with  center 
frequencies  78  Hz  apart.  The  output  of  this  process  was  a matrix 

| 

array,  corresponding  to  each  word  from  the  tape,  of  amplitude  values 
200  time  windows  long  and  divided  into  64  frequency  bands.  These  64 
frequency  bands  were  then  combined  into  1/3  octave  groups  to  model 
the  frequency  sensitivity  of  the  average  human  ear,  which  reduced  the 
matrix  array  to  16  frequency  bands.  This  processing  of  the  analog 
signal  was  used  to  model  the  preprocessing  which  occurs  in  the  human 
ear.  A comparison  between  each  word  from  the  input  tape  and  the 
noisy  output  tape  was  then  made  using  a weighted  mean  squared  error 
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AN  ALGORITHM  FOR  DETERMINING 
SPEECH  INTELLIGIBILITY 
I.  Introduction 


This  work  is  in  response  to  a need  identified  by  Air  Force  Communi- 
cations Service  (AFCS).  There  have  been  numerous  studies  in  the  area  of 
machine  prediction  of  speech  intelligibility;  however,  the  Air  Force 
planners  who  requested  this  work  are  still  using  human  listeners  to 
determine  intelligibility.  They  either  think  that  available  computer 

I 

methods  do  not  produce  sufficiently  accurate  results  or  that  the  com- 
puter schemes  are  too  complex  and  difficult  to  apply  to  their  specific 
problems.  The  intent  of  this  work  is  to  take  applicable  techniques  from 
work  that  has  already  been  done  and  combine  them  to  develop  a simplified, 
accurate  method  to  evaluate  voice  intelligibility  with  a computer. 


Background 

The  oldest  method  for  determining  the  intelligibility  of  speech  is 
a subjective  method  that  involves  trained  speakers  and  listener  panels 
that  directly  score  the  percentage  of  speech  that  is  intelligible.  This 
method  is  still  considered  the  most  reliable  way  to  measure  intelligibil- 
ity because  it  produces  repeatable  results.  The  disadvantages  of  the 

* 

subjective  method  are  the  considerable  cost,  large  number  of  manhours,, 
and  specialized  facilities  and  equipment  required. 

An  early  attempt  to  simplify  the  procedure  for  determining  the 
intelligibility  of  speech  involved  calculation  of  the  mean  squared  error 
(MSE)  between  an  audio  waveform  and  the  same  audio  signal  corrupted  by 
noise.  This  process  uses  the  procedure  given  by  Equation  1. 
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MSE  = It  I lx(t)  ' Y(t)l2dt  U) 

-T 

Such  an  approach  does  not  yield  acceptable  estimates  of  speech  intelli- 
gibility. This  failure  is  attributed  to  the  fact  that  vowels  contain 
more  power  than  consonants  in  speech,  but  consonants  are  more  important 
in  determining  intelligibility  than  vowels  (Ref  15:277).  This  method  of 
intelligibility  measurement  is  no  longer  in  use  due  to  these  short- 
comings . 

One  of  the  currently  popular  methods  of  automating  intelligibility 
prediction  is  use  of  the  Articulation  Index  (AI).  One  method  of  calcu- 
lating the  AI  is  by  transforming  the  speech  signal  into  an  electrical 
signal  and  then  passing  it  through  a set  of  contiguous  bandpass  filters 
each  1/3  octave  wide.  The  voltage  output  of  each  of  these  filters  is 
used  to  calculate  a root  mean  square  (RMS)  voltage  as  shown  by  Equation  2 

T 

RMS  = JL  | X2 ( t)dt  (2)  ' 

-T 

The  noise  that  is  affecting  the  system  is  passed  through  this  same  set 
of  filters  and  a root  mean  square  noise  voltage  in  each  filter  bandpass 
is  calculated.  The  value  of  the  noise  RMS  voltage  is  subtracted  from 
the  speech  RMS  voltage  for  each  filter.  If  this  difference  is  30  or 
more  decibels,  it  is  assigned  a value  of  30.  If  the  difference  falls 
in  the  range  of  0 to  30  decibels,  the  actual  decibel  value  is  assigned. 

If  the  difference  is  0 or  a negative  value,  it  is  assigned  a value  of  0. 
These  values  for  each  filter  are  then  multiplied  by  weighting  factors 
for  each  of  the  different  frequency  bands.  These  products  are  then 
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added  together  and  their  sura  is  the  AI  (Ref  1:6-15).  This  process  is 
illustrated  by  the  block  diagram  in  Figure  1. 
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The  AI  can  be  calculated  by  programming  the  previous  procedure  into 
a computer  routine.  These  programs  are  in  common  use  today  and  are  an 
acceptable  predictor  of  intelligibility  as  long  as  the  noise  present  is 
additive  white  Gaussian  noise.  Colored  noise  and  multiplicative  noise 
require  a complete  recalibration  of  the  AI  system  to  give  good  results 
(Ref  7:2).  This  illustrates  that  the  type  of  noise  present  must  be 
known  exactly  and  corrected  for  to  maintain  the  accuracy  of  this  index. 
When  this  method  of  intelligibility  prediction  is  applied  to  a digital 
voice  system  or  a system  with  quantization  noise  present,  there  is  no 
correction  of  recalibration  that  will  provide  acceptable  intelligibility 
estimates  (Ref  7:2-3). 

A recent  development  in  the  field  of  automated  voice  intelligibility 
prediction  is  the  use  of  linear  predictive  coding  (LPC).  LPC  derives 
its  name  from  the  predictive  process  it  is  based  on  which  states:  given 
P samples  of  a speech  signal,  the  next  sample  can  be  predicted  approxi- 
mately by  a linear  function  of  the  P known  samples  (Ref  6:  A-3).  LPC 
models  the  vocal  tract  as  an  all-pole  digital  filter  and  estimates  the 
filter  parameters  (predictor  coefficients)  using  the  time  domain  speech 
waveform.  This  model  of  the  voice  tract  assumes  the  vocal  tract  model 
to  be  a time-varying  filter  with  parameters  changing  slowly  enough  so 
they  can  be  considered  fixed  over  a speified  time  interval.  It  accounts 
for  the  glottal  volume  flow  and  radiation  of  sound  from  the  mouth  in 
addition  to  vocal  tract  sounds  (Ref  7:3-4).  The  most  popular  way  to 
estimate  linear  prediction  coefficients  (a.)  is  the  autocorrelation 
method.  This  method  involves  time  sampling  an  analog  speech  signal  and 
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Figure  J.  Process  Used  to  Calculate  Articulation  Index  (AI) 
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windowing  these  time  samples,  usually  with  a Hamming  window  256  time 
samples  long.  These  windowed  speech  samples  are  used  to  calculate  a 
linear  prediction  of  residual  energy  both  for  an  undistorted  speech 
signal  and  for  the  same  speech  signal  after  it  has  been  corrupted  by 
additive  noise.  These  residual  energy  terms  are  then  compared  with  the 
actual  energy  terms  and  a distance  measure  is  derived  from  these  com- 
parisons. The  calculation  of  these  residual  energy  terms  and  their 
comparison  is  a long  and  involved  mathematical  process  presented  in 
detail  by  Hartmann  (Ref  6:24-33). 

The  use  of  LPC  techniques  overcomes  the  disadvantages  of  sensitiv- 
ity to  the  type  of  noise  present  that  affects  AI.  LPC  intelligibility 
predictions  give  a good  correlation  with  listener  scores  when  averaged 
over  50  or  more  words  (Ref  6:13-19).  The  disadvantage  of  LPC  is  that 
it  involves  a large  number  of  computer  computations  and  consequently 
consumes  a great  deal  of  computer  time  to  analyze  a small  amount  of 
speech.  A second  disadvantage  of  this  method  is  that  it  requires  very 
close  synchronization  of  the  words  on  the  undistorted  tape  with  the 
same  words  on  the  tape  containing  additive  noise.  This  requirement  for 
exacting  synchronization  makes  it  necessary  to  employ  very  specialized 
taping  equipment  to  make  this  process  work  (Ref  6:  18,20). 

Approach 

The  approach  to  computer  evaluation  of  speech  intelligibility  used 
in  this  thesis  combines  some  features  of  the  Ari ticuiation  Index  calcu- 
lation, the  linear  predictive  coding  method,  and  the  mean  squared  error 
calculation. 

The  human  auditory  system  performs  multiple  stages  of  preprocessing 
on  an  audio  signal  before  it  reaches  the  brain.  Therefore,  it  seems 


reasonable  to  assume  that  if  the  processes  occurring  in  the  ear  and 
brain  can  be  modeled,  it  will  be  possible  to  make  the  same  type  intelli- 
gibility determination  as  the  human.  The  first  step  in  doing  this  is  to 
model  the  preprocessing  which  occurs  in  the  ear. 

To  model  the  action  of  the  ear  drum  in  converting  sound  pressure 
variations  to  vibration  and  the  middle  ear  which  transmits  these  as  a 
varying  mechanical  vibration  to  the  inner  ear,  a tape  recorder  was  used. 
The  recorder  converts  sound  pressure  variations  into  an  appropriate, 
continuously  varying  analog  signal. 

In  the  inner  ear  (cochlea)  the  mechanical  vibration  variations 
undergo  the  next  stage  of  processing.  This  process  is  quite  complex, 
but  it  appears  to  involve  excitation  of  the  neurons  at  the  base  of  the 
hair  cells  inside  the  cochlea  due  to  movement  of  the  hair  cells.  This 
movement  is  a result  of  the  mechanical  vibration  coming  from  the  middle 
ear  causing  the  fluid  in  the  cochlea  to  move  the  hair  cells.  Since  the 
cochlea  is  apparently  a frequency  analyzing  device,  a model  for  the 
inner  ear  should  present  the  signal  in  a frequency  format  (Ref  10).  The 
model  for  the  cochlea  used  in  this  thesis  consists  of  sampling  the 
analog  waveform  from  the  tape  at  the  Nyquist  rate  and  running  these 
samples  through  a bank  of  contigious  bandpass  filters.  The  output  of 
these  filters  are  grouped  into  1/3  octave  bands  to  simulate  the  sensi- 
tivity of  the  ear.  This  changes  the  analog  waveform  into  a frequency 
format. 

Kabrisky  proposed  that  the  cortex  of  the  brain  is  capable  of  per- 
forming a two-dimensional  cross-correlation  of  a test  image  with  a 
stored  pattern  (Ref  9:  47-57).  This  theory  about  the  visual  system  was 
extended  to  the  auditory  system  by  Dailey  and  Sutton  (Ref  3).  Assuming 
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the  brain  performs  this  two-dimensional  cross-correlation  between  the 
input  signal  from  the  cochlea  and  phonemes  stored  in  memory  and  picks 
the  largest  correlation  value  to  indicate  what  phoneme  was  heard,  this 
process  must  be  modeled.  Since  the  undistorted  phoneme  stored  in  memory 
would  have  to  be  in  the  same  format  as  the  incoming  signal  from  the 
cochlea,  a possible  model  of  this  process  would  be  to  compare  the  un- 
distorted input  word,  preprocessed  by  the  models  of  the  ear,  with  the 
same  word  imbedded  in  noise  and  run  through  the  same  processing.  The 
correlation  process  will  only  determine  a measure  of  the  difference 
between  a word  and  its  corrupted  form,  so  it  appears  that  a mean  squared 
error  calculation  can  simulate  this  correlation  satisfactorily.  This 
mean  squared  error  will  be  weighted  because  of  the  grouping  of  filter 
outputs  occurring  in  the  cochlea  model. 

Objective 

The  object  of  this  research  is  to  explore  the  possibility  of 
developing  a computer  program  that  will  give  a reasonably  accurate 
prediction  of  the  intelligibility  of  speech.  This  system,  if  success- 
ful, will  be  used  by  people  with  varying  degrees  of  computer  support 
available  to  them.  For  this  reason  the  main  idea  was  to  keep  the  pro- 
cedure simple  and  automate  it  as  much  as  possible.  Another  considera- 
tion was  to  minimize  the  computer  memory  and  central  processor  time 
required  for  the  processing  so  people  can  get  the  program  through  a 
busy,  time  shared  computer  in  a reasonable  amount  of  time.  The  final 
goal  was  to  eliminate  the  need  for  any  elaborate  or  unique  equipment  to 
make  or  process  the  audio  tape. 
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Scope 

The  scope  of  this  project  is  limited  to  developing  a computer  pro- 
gram that  will  model  the  actions  of  the  ear  on  sound  waves  and  apply  one 
possible  comparison  scheme  to  model  the  action  of  the  brain  on  the 
processed  audio  signal.  Section  II  outlines  the  procedures  used  to 
make  audio  tapes  that  are  used  for  both  human  listener  and  computer 
intelligibility  testing.  Section  III  details  the  initial  computer  pro- 
cessing of  these  audio  tapes  to  sample  them  at  the  Nyquist  rate  and 
perform  a Fast  Fourier  Transform  (FFT)  or  these  time  samples.  Section 
IV  describes  how  the  original  matrix  array  of  amplitude  values,  produced 
by  the  FFT  process,  was  compressed  so  it  would  closely  approximate  the 
way  the  ear  processes  sound  data.  A method  of  representing  each  word  by 
a speech  spectrogram  and  using  this  to  locate  the  word  exactly  in  a 
group  of  time  samples  of  the  input  wave  form  is  discussed.  Section  V 
deals  with  the  cross-correlation  method  used  to  locate  a word  which  is 
imbedded  in  noise.  It  evaluates  how  much  the  word  has  been  distorted 
by  the  additive  noise  using  a weighted  mean  squared  error  comparison 
between  the  word  before  the  noise  is  added  and  the  same  word  plus 
additive  noise.  The  last  two  sections  show  the  results  of  this  proce- 
dure and  make  some  recommendations  for  further  work  in  this  area. 
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II . Data  Acquisition 

The  data  used  in  these  tests  for  intelligibility  was  the  Diagnostic 
Rhyme  Test  Number  IV  (DRT-IV).  DRT-IV  is  composed  of  58  rhyming  word 
pairs  with  each  word  pair  designed  to  test  for  one  of  six  speech  attri- 
butes. There  are  eight  rhyming  pairs  in  the  list  which  check  for  each 
attribute.  The  six  attributes  tested  for  are  voicing,  nasality,  susten- 
tion, sibilation,  graveness,  and  compactness  (Ref  15:15-21).  The  words 
that  test  for  these  attributes  are  separated  by  ten  pairs  of  filler 
words.  The  DRT-IV  used  in  these  tests  is  shown  in  Table  I and  the  words 
which  test  for  each  of  the  speech  attributes  are  identified. 

Acquisition  Procedure 

The  data  acquisition  consisted  of  a male  speaker  reading  one  word 
of  each  rhyming  word  pair  from  DRT-IV  and  recording  these  words  on  one 
track  of  a stereo  tape  recorder.  The  other  track  of  the  stereo  tape 
was  used  to  record  one  kilohertz  tones  which  are  used  for  timing  refer- 
ences in  subsequent  processing  of  the  audio  tape.  The  recorder  used 
was  a reel-to-reel  Sony  Model  850  which  gave  a reasonably  high  quality 
of  audio  reproduction.  Recording  of  the  DRT-IV  words  on  tape  was  used 
to  model  the  action  of  the  outer  ear  which  converts  the  pressure  varia- 
tions of  sound  into  an  analog  signal  format. 

In  recording  the  test  audio  tapes,  two  different  male  speakers  were 
used  to  reduce  the  possible  effect  of  a speaker's  regional  accent 
affecting  the  intelligibility  results.  The  first  speaker  had  a southern 
accent  (Arkansas)  and  the  second  had  very  little  regional  accent 
(Minnesota).  Four  master  tapes  were  made  of  DRT-IV,  two  by  the  first 
speaker  and  two  by  the  second  speaker.  These  four  master  tapes  were 
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Table  I 

Diagnostic  Rhyme  Test 


DRT  IV- (2) 


PEST  - TEST 
VAULT  - FAULT 
DUES  - NEWS 
VEE  - BEE 
THANK  - SANK 
ROD  • WAD 
SO  - SHOW 
LID  - RID 
DENSE  - TENSE 
BOSS  - MOSS 
FOO  - POOH 
ZEE  - THEE 
FAD  - THAD 
HOP  - FOP 
ROW  - LOW 
GIN  - CHIN 
BEND  - MEND 
CHAW  - SHAW 
JUICE  - GOOSE 
PEAK  - TEAK 
BAT  - GAT 
ROCK  - LOCK 
GOAT  - COAT 
MIT  - BIT 
THEN  - DEN 
GAUZE  - JAWS 
NOON  - MOON 
KEY  - TEA 
RAMP  - LAMP 


-( filler)- 
-( voicing)- 
- (nasal ity) - 
-( sustention) - 
-(sibilation)- 
-(graveness)- 
-( compactness )- 
-( filler)- 
-(voicing)- 
-(nasality)- 
-(sustention)- 
-( sibilation)- 
- (graveness ) - 
- (compactness )- 
-( filler)- 
-(voicing)- 
-(nasality)- 
-( sustention)- 
-( sibilation)- 
- (graveness )- 
-( compactness )- 
-(filler)- 
-( voicing)- 
-(nasality)- 
-( sustention)- 
-( sibilation)- 
-( graveness )- 
-(compactness)- 
-(filler)- 


FAN 

- 

PAN 

CHOCK 

- 

JOCK 

NOTE 

- 

DOTE 

TICK 

- 

THICK 

CARE 

- 

CHAIR 

DONG 

- 

BONG 

YOU 

- 

RUE 

REEK 

- 

LEAK 

GAFF 

- 

CALF 

BOMB 

- 

MOM 

DOUGH 

- 

THOUGH 

GILT 

- 

JILT 

PENT 

- 

TENT 

YAWL 

- 

WALL 

LOOT 

- 

ROOT 

VEAL 

- 

FEEL 

NAB 

- 

DAB 

BON 

- 

VON 

SOLE 

- 

THOLE 

THIN 

mm 

FIN 

KEG 

- 

PEG 

LONG 

- 

WRONG 

TUNE 

- 

DUNE 

MEAT 

- 

BEAT 

SHAD 

- 

CHAD 

GOT 

- 

JOT 

DOLE 

BOWL 

DILL 

- 

GILL 

LEND 

. 

REND 
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played  into  the  input  of  a General  Electric  (GE)  preliminary  development 
model,  spread-spectrum  radio  transmitter.  The  modulated  radio  signal 
was  transmitted  to  a hybrid  summer  where  it  was  mixed  with  additive 
noise  from  a pseudo-random,  matched  spectrum  noise  generator.  The  output 
of  the  hybrid  summer  was  then  fed  into  the  companion  receiver  of  the  GE 
transmitter  and  the  output  audio  tapes  were  recorded  at  the  audio  output 
stage  of  the  receiver.  The  noise  generator  was  used  to  simulate  inten- 
tional jamming  of  the  radio  link.  Output  tapes  were  made  at  eight 
different  signal  to  jammer  (S/J)  levels.  The  four  input  tapes  were 
each  used  twice  as  the  input  when  making  the  different  S/J  level  output 
tapes.  The  eighth  output  tape  had  the  lowest  S/J  ratio  and  the  signal 
was  too  low  to  be  usable  for  testing,  so  this  tape  was  discarded.  The 
remaining  tapes  have  S/J  levels  numbered  from  one  through  seven.  The 
highest  S/J  level  is  number  one  and  the  S/J  ratio  decreases  as  the 
number  increases  with  tape  number  seven  representing  the  lowest  S/J 
ratio.  The  actual  S/J  levels  associated  with  these  numbers  are  class- 
ified Secret.  If  the  actual  S/J  levels  are  desired,  this  information 
is  given  in  the  classified  portion  of  Captain  Bauer's  thesis  (Ref  2). 

All  recordings  of  the  output  of  this  system  were  made  on  new 
Scotch,  1/4  inch  tape  using  a Sony  Model  850  recorder.  These  tapes  were 
recorded  at  a tape  speed  of  inches  per  second. 

The  GE  communications  system  used  in  this  test  was  being  evaluated 
by  a fellow  student  for  intelligibility  using  human  listener  tests 

(Ref  2).  This  provided  a convenient  way  to  obtain  human  listener  in- 

\ 

tellibility  data  to  compare  to  the  intellibility  predictions  made  by 
the  computer  method  presented  in  this  thesis. 
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III.  Analog  to  Digital  Conversion 


The  initial  processing  of  the  four  master  DRT-IV  input  tapes  and 
the  seven  output  tapes  with  different  S/'J  additive  noise  ratios,  was 
done  by  the  Analog/Hybrid  Systems  Branch  of  the  Aeronautical  Systems 
Division  (ASD)  Computer  Center. 

When  the  audio  tapes  were  made,  the  words  from  DRT-IV  were  recorded 
on  the  right  channel  of  a stereo  tape  recorder  and  a one  kilohertz  (KHz) 
tone,  1/2  second  long,  was  recorded  on  the  left  channel.  These  one  KHz 
tones  were  machine  generated  with  the  apparatus  shown  in  Figure  2. 

Every  time  the  tone  sounded,  the  next  word  from  the  DRT-IV  list  was 

recorded  within  2 % seconds  after  the  tone.  The  tones  were  spaced 
seven  seconds  apart  so  there  was  at  least  4’S  seconds  after  each  word 
before  the  next  tone.  The  Analog/Hybrid  Branch  played  each  tape  back 
and  low  pass  filtered  it  to  2.5  KHz  and  fed  this  into  the  Comcor 
Ci-5000/6  analog  computer.  The  computer  sampled  the  input  at  the 
Nyquist  rate  of  5 KHz.  Using  2.5  KHz  as  the  upper  cutoff  frequency  was 
necessary  because  the  bandwidth  of  the  amplifiers  in  the  analog  computer 
was  limited  to  this  value.  Since  it  was  desired  to  analyze  the  speech 
input  over  a range  of  zero  to  5 KHz,  it  was  necessary  to  analyze  the 
tape  by  playing  it  back  at  a tape  speed  of  3 3/4  inches  per  second, 

half  the  recording  speed,  to  give  the  effect  of  low  pass  filtering 

to  5 KHz  and  sampling  at  a 10  KHz  rate  at  the  original  recording  speed. 
This  makes  it  possible  to  evaluate  the  speech  signal  over  the  desired 
frequency  range  in  spite  of  the  limitations  imposed  by  the  computer's 
amplifier  bandwidth. 

The  input  speech  signal  was  amplified  to  approximately  100  volts 
prior  to  processing  to  provide  a sufficient  voltage  swing  to  utilize 


r 
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the  accuracy  possible  with  the  11  bit  analog  to  digital  converters  in 
the  computer.  These  11  bit  numbers  are  a binary  representation  of  a 4 
digit  decimal  number.  These  numbers  give  the  voltage  level,  between  0 
and  100  volts,  of  the  analog  waveform  each  time  the  waveform  is  sampled 
by  the  digital  computer.  The  1 KHz  tones  recorded  on  the  left  channel 

of  the  audio  tapes  were  used  to  trigger  the  sampling  equipment  in  the 

% 

computer.  When  the  tone  occurs,  the  computer  starts  sampling  the  input 
audio  waveform  and  continues  sampling  for  2'/t  seconds,  then  stops  until 
the  next  tone  occurs.  The  word  from  DRT-IV  is  contained  somewhere  in 
this  2%  second  sampling  interval  and  will  be  located  exactly  by  subse- 
quent processing. 

Frequency  Analysis 

The  proposed  model  for  the  inner  ear  requires  that  the  digitized 
analog  speech  data  be  represented  in  the  equivalent  frequency  domain. 
Fast  Fourier  Transforms  (FFT)  techniques  were  used  to  convert  the 
digitized  data  into  a frequency  representation  format  (Ref  5:41-52). 

The  actual  data  conversion  involves  grouping  the  digitized  time  samples 
into  groups  of  equal  length  (windowing)  and  applying  FFT  techniques  to 
these  window  groupings  to  simulate  a bank  of  narrow  bandpass  filters. 

The  size  and  shape  of  the  window  is  based  on  the  desire  to  have  a wide 
band  analysis  while  retaining  reasonable  time  resolution.  The  methods 
for  doing  this  are  discussed  in  detail  by  Neyman  (Ref  12:17-18).  The 
window  used  is  rectangular  and  128  t’me  samples  long.  The  digitized 
data  was  processed  using  this  window  size  by  an  Analog/Hybrid  Branch 
program  called  AMPSPC . This  program  gave  64  discrete  amplitude  values, 
each  corresponding  to  a 78.125  Hz  frequency  segment  located  in  the  range 
of  0 to  5 KHz  and  covering  a time  window  of  12.8  milliseconds 
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(128  time  samples  at  10  KHz  sample  rate).  This  produced  a 64  x 200 
matrix  array  of  amplitude  values.  Each  of  these  matrix  arrays  contains 
one  word  from  the  DRT-IV  audio  tape.  The  matrix  arrays  were  written  on 

a nine  track  ASD  Computer  Center  library  tape  (L-tape)  and  stored  for 

% 

later  processing  by  the  CDC-6600  computer. 
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IV.  Digital  Signal  Processing 

Each  word  of  the  DRT-IV  is  now  contained  in  a 64  x 200  matrix 
array  stored  on  a computer  L-tape.  Each  element  of  this  array  repre- 
sents the  signal  amplitude  to  four  decimal  place  accuracy.  This  signal 
representation  corresponds  to  an  analog  speech  signal  that  has  been 
run  through  a hank  of  64  bandpass  filters  with  center  frequencies 
78.125  Hz  apart  and  an  upper  cutoff  of  5 KHz. 

Data  Compression 

In  general  the  human  ear  is  not  a linear  receiving  device.  In 
order  to  model  the  nonlinear  frequency  response  of  the  ear  it  was 
necessary  to  restructure  the  digital  data  so  it  would  upp.  ,ximate 
the  ear's  unusual  sensitivity  to  frequency  change.  The  six  lowest 
frequency  bands  of  the  matrix  array  were  left  unchanged.  This  group 
has  center  frequencies  of  78.125,  156.250,  234.375,  312.500,  390.625, 
and  468.750  Hz.  All  higher  frequencies,  up  to  5 KHz,  are  grouped 
into  approximately  1/3  octave  ranges  and  the  energy  content,  of  each 
group  is  the  sum  of  the  individual  array  elements  contained  within 
that  group.  This  restructuring  produced  16  frequency  dependent  ampli- 
tudes from  the  original  64.  The  frequency  groupings  that  produce 
these  16  values  are  shown  in  Table  II. 

Adding  the  energy  of  each  array  element  within  a 1/3  octave  group 
compensated  for  the  lower  amplitude  of  sound  harmonics  produced  by  the 
vocal  chords  at  high  frequencies.  This  eliminated  the  need  to  use  the 
standard  preemphasis  technique  of  increasing  the  signal,  magnitude  by 
six  decibles  per  octave  above  350  Hz  (P.ef  14:311). 
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Speech  Frequencies 


2343.750  4343. 753 

2421.875 4921.875 

2500.00  5000.000 


Gray  Scale  Spectrogram 

It  is  desirable  to  be  able  to  see  a spectrogram  of  the  word  when 
working  with  the  16  x 200  matrix  array  that  results  from  the  previous 
compression  procedure.  This  aids  in  quickly  locating  where  the  word 
occurs  in  the  200  time  windows  and  determining  the  length  of  the  word. 
A convenient  method  for  creating  a gray  scale  spectrogram  using  com- 
puter overprint  symbols  was  developed  by  Neyman  and  is  used  here 
(Ref  12:22-24).  Table  III  shows  the  overprint  symbols  used  to  create 
the  spectrogram  and  Figure  3 shows  what  the  actual  spectrogram  of  a 
word  looks  like. 


Table  III 

Overprint  Symbols  for  Speech  Spectrograms 


Word  Location  Technique 

In  order  to  use  the  matrix  array  of  amplitude  values  for  comparison 
purposes  it  is  necessary  to  locate  the  word  in  the  200  time  windows  and 
save  only  the  part  of  the  array  containing  the  word.  This  location 
process  was  accomplished  by  thresholding  eacn  value  in  the  matrix  array 
at  1.5  and  saving  the  part  of  the  array  where  the  values  exceeded  this 
level.  A filtering  process  was  included  in  the  program  to  eliminate  a 
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noise  spike,  occurring  outside  the  word,  from  being  mistaken  for 
part  of  the  word. 


Master  Tape  Processing 

The  plan  was  to  sample  the  analog  speech  data,  convert  it  to  a 
frequency  format  representation,  compress  the  resulting  64  frequency 
divisons  to  16,  locate  the  word  exactly  in  the  200  time  windows,  and 
store  only  the  portion  of  the  200  column  array  where  the  word  occurs 
for  use  in  subsequent  processing.  These  steps  were  written  into  a 
computer  program  for  use  in  the  CDC-6600  computer. 

This  program  was  used  to  process  the  four  master  DRT-IV  tapes 
that  were  used  as  inputs  to  the  GE  radio  system.  Each  master  tape 
was  run  through  this  program  and  the  column  number  where  each  word 
started,  the  number  of  columns  (M)  occupied  by  the  word,  and  the  16  x M 
array  of  amplitude  values  containing  the  word  were  recorded  on  a second 
L-tape. 

An  alternate  method  for  locating  the  word  in  the  200  time  windows 
was  tried,  to  establish  a basis  for  comparing  the  effectiveness  of  the 
previous  procedure.  The  matrix  array  was  first  normalized  using 
Equation  3 


where  a.  . = normalized  array  element.  The  normalized  array  was  then 
1 » J 

thresholded  at  an  appropriate  level  and  the  computer  program  predicted 
where  the  word  was  located  at  in  the  200  column  matrix.  It  was  easy 
to  see  where  the  word  occurred  within  the  200  column  matrix  by  looking 
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at  the  accompanying  spectrogram,  Figure  4,  so  this  was  used  as  a means 
of  evaluating  the  two  computer  techniques  given  above  for  locating  the 
word.  This  comparison  showed  that  by  normalizing  the  data  in  the  array 
prior  to  doing  a computer  search  for  the  word,  the  computer  program 
frequently  failed  to  correctly  locate  the  word.  When  the  computer 
search  for  the  word  was  done  without  normalizing  the  matrix  array,  it 
found  the  word  accurately  every  time.  No  explanation  can  be  offered 
to  explain  why  normalizing  the  array  data  caused  the  word  location  pro- 
gram to  fail. 
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Figure  4.  Digital  Speech  Spectrogram  and  Associated  Matrix  Array 
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V.  Cross-Correlation  and  Mean  Squared  Error  Calculation 


The  seven  noisy  tapes  made  at  the  audio  output  of  the  GE  radio 
under  test  must  now  be  compared  with  the  master  input  tape  from  which 
each  was  made.  The  input  tape  is  compared  with  the  output  tape  and 
the  mean  squared  error  between  each  input  word  and  its  output  plus 
additive  noise  is  calculated. 

To  locate  where  the  word  occurred  on  the  noisy  output  tape,  it  was 
necessary  to  perform  a cross-correlation  between  the  input  word  array 
and  the  16  x 200  output  array  containing  the  same  word  imbedded  in 
noise.  The  length  of  the  word  and  the  part  of  the  array  containing 
the  word  from  each  master  tape  were  previously  recorded  on  a computer 
L-tape.  This  L-tape  is  read  a word  at  a time  and  cross-correlated  with 
the  corresponding  16  x 200  array  on  the  L-tape  containing  the  noisy 
words.  The  length  of  the  word  is  read  first  and  that  number  subtracted 
from  200  to  find  the  number  of  cross-correlations  that  must  be  per- 
formed. Next  the  arrays  are  read  into  the  computer  core  memory  and 
a cross-correlation  is  performed  with  the  first  column  of  the  word 
from  the  master  tape  lined  up  with  the  first  column  of  the  16  x 200 
noisy  array.  After  each  cross-correlation  value  is  determined,  the 
array  containing  the  word  from  the  master  list  is  shifted  one  column 
to  the  right  with  respect  to  the  noisy  array.  When  all  the  cross- 
correlations  have  been  performed  for  that  word,  the  largest  value  com- 
puted indicates  the  point  where  the  input  word  and  the  noisy  output  word 
were  aligned.  The  equation  used  to  compute  each  of  these  cross- 
correlations (p)  is 


P (t  ) 


L 

l 
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A. 


U1  j = l 


i.J  l,J 


(4) 
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where  L = 200  - length  of  word  read  from  master  tape 

j = element  of  word  array  from  master  tape 
B.  . = element  of  word  array  from  noisy  tape 


When  the  maximum  cross-correlation  value  has  located  the  word  (that 
is,  once  the  value  of  is  known  such  that  P(tq)  is  a maximum)  in 
the  16  x 200  array  containing  the  additive  noise,  there  is  enough  infor- 
mation available  to  calculate  the  mean  squared  error  (MSE)  between  the 
two  words  using  the  formula 

L 16  L+t0  16 

MSE  = l l (A  J2  - 2 P(T  ) + l l (B.  )2  (5) 

i=l  j=l  1,J  0 i=T  j=l  1,J 

o 

To  compute  the  MSE  the  maximum  cross-correlation  value  determined 
in  the  previous  operation  is  multiplied  by  two  and  is  the  middle  term 
of  the  MSE  equation.  Since  the  exact  location  of  the  word  in  the 
16  x 200  noisy  array  is  now  known,  the  last  term  of  the  MSE  equation  can 
be  calculated  using  this  information  to  square  the  elements  of  the 
array  containing  the  word  and  sum  these  squares.  The  L-tape  containing 
the  array  of  the  word  from  the  master  input  list  is  used  to  calculate 
the  first  term  in  the  MSE  equation  by  squaring  each  element  of  the 
array  and  then  summing  the  squares. 

The  average  mean  squared  error  for  the  DRT-IV  list  recorded  at  each 
of  the  seven  different  S/J  ratios  was  determined  by  summing  the  MSE  for 
all  58  words  in  the  list  and  dividing  this  sum  by  58.  This  gives  an 
average  MSE  corresponding  to  each  S/J  level  to  be  used  in  deriving  an 
estimate  of  the  intelligibility  atthatS/J  level. 

The  brain  "expects”  to  see  uncorrupted  phoneme  groups  (words) 
characterized  in  a certain  way.  Assuming  that  the  models  used  here  give  a 
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reasonably  accurate  representation  of  the  preprocessing  which  occurs 


in  the  ear,  it  should  now  be  possible  to  measure  the  difference 


between  what  the  brain  expects  to  hear  and  what  it  actually  hears. 


It  is  conjectured  that  this  difference  is  inversely  related  to  the 


intelligibility  of  what  is  heard.  To  model  this  process  the  MSE  cal- 


culation provides  a measure  for  determining  the  difference  between  a 


word  and  that  same  word  after  it  has  been  corrupted  by  noise.  It  is 


assumed  this  distance  measure  can  now  be  related  to  intelligibility. 


H 


25 


VI.  Results 


i 


The  first  computer  program  discussed  in  section  IV  was  designed 
to  locate  each  word  in  the  200  time  windows.  This  program  worked 
perfectly  as  long  as  the  tapes  containing  the  words  had  a low  noise 
level  compared  to  the  signal  amplitude  of  the  word.  It  was  also  , 
necessary  to  amplify  the  peak  levels  within  each  word  to  at  least  75 
volts  prior  to  digital  sampling. 

The  second  program,  discussed  in  section  V,  was  designed  to  first 
locate  the  word  in  the  200  time  windows  on  the  tape  with  additive 
noise  by  a cross-correlation  with  the  same  word  without  noise.  This 
cross-correlation  provided  a sharp  peak  with  an  amplitude  well  above 
the  other  values  to  indicate  when  the  two  words  were  aligned.  This 
distinct  peak  occurred  even  at  the  lowest  S/J  ratio.  This  can  be  seen 
by  looking  at  the  cross-correlation  values  for  a word  at  the  lowest 
S/J  level,  Figure  5. 

The  second  part  of  the  program  described  in  section  V is  used  to 
calculate  the  mean  squared  eri'or  between  the  input  word  and  the  same 
word  after  it  passes  through  the  radio  system  and  is  corrupted  by 
noise.  Figure  6 shows  a plot  of  mean  squared  error  values  versus  the 
seven  different  S/J  ratios.  The  increase  in  MSE  is  approximately  linear 
as  the  S/J  ratio  decreases. 

Figure  7 shows  the  average  number  of  errors  made  by  the  10  people 
who  listened  to  the  noisy  tapes  at  each  S/J  ratio. 

A scatter  plot  of  human  listener  error  scores  versus  mean  squared 
error  values  is  shown  in  Figure  8.  This  plot  displays  the  data  used 
to  calculate  Pearson's  Correlation  Coefficient. 
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Figure  6.  Plot,  of  Mean  Squared  Error  Versus  S/J 


In  this  case  P was  calculated  to  be  0.74.  To  find  the  percentage  of  the 
variance  in  the  listener  errors  accounted  for  by  observing  the  mean 
squared  error  values,  under  a Gaussian  assumption  (zero  mean  Gaussian), 

P must  be  squared  and  multiplied  by  100,  which  gives  a value  of  55%. 
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VII.  Conclusions  and  Recommendations 

The  value  of  0.74  which  was  calculated  for  Pearson's  Correlation 

I 

Coefficient  confirms  that  there  is  significant  correlation  between  the 
intelligibility  scores  of  the  human  listeners  and  the  MSE  values  calcu- 
lated by  the  computer  program.  This  appears  to  support  the  ideas  set 
forth  in  this  thesis  as  a reasonable  approach  to  predicting  voice  intel- 
ligibility. This  single  comparison  is  not  enough  to  determine  an  exact 
relationship  between  the  mean  squared  error  value  and  intelligibility, 
but  it  seems  to  provide  a starting  place  for  further  refinement  of  the 
procedure. 

The  human  listener  scores  used  as  a comparison  were  the  average  of 
ten  listeners.  These  results  showed  an  unexpectedly  high  error  rate  at 
the  best  S/J  level  which  decreased  as  the  noise  got  worse  for  the  first 
three  S/J  levels,  Figure  7.  This  trend  disappears  if  a much  larger 
group  of  listener  scores  are  averaged  together  and  the  error  rate 
increases  monotonical ly  as  the  S/J  ratio  decreases  (Ref  2).  This 
listener  performance  would  have  given  closer  agreement  with  the  MSE 
predictions.  This  suggests  that  a listener  group  considerably  larger 
than  ten  is  required  to  get  a reliable  intelligibility  figure. 

One  of  the  original  goals  of  this  thesis  was  to  provide  a means 
of  making  computerized  intelligibility  measurement  that  do  not  require 
any  special  or  unique  equipment.  This  goal  was  met,  with  the  exception 
of  the  processing  described  in  section  III  done  by  the  ASD  Analog/Hybrid 
Branch.  A recommendation  for  further  work  in  this  area  is  to  develop 
a program  for  one  of  the  more  elaborate  minicomputers  in  common  use  in 
the  Air  Force  that  can  perform  these  operations. 


I n 
»)  <■* 


Further  investigation  in  this  area  could  focus  on  methods  of  com- 
paring the  master  word  and  noisy  word,  after  they  have  been  preprocessed 
by  the  ear  models,  other  than  MSE.  This  may  yield  a better  predictor 
than  MSE. 
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jendix  A 


Sequence  Chart  for  Intelligibility  Prediction 
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radio  system  distortion  effects.  Both  the  input  recording  and  the  noisy  output 
recording  were  digitized  by  sampling  the  analog  waveforms  at  a 10  kilohertz  rate, 
These  digital  samples  were  converted  to  a frequency  format  by  windowing  the  time 
samples  with  a rectangular  window  128  time  samples  in  length  and  processing  them 
using  Fast  Fourier  transform  techniques This  procedure  simulated  running  the 
analog  speech  signal  through  a bank  of  cVniiguous  narrow  bandpass  filters 
covering  the  range  of  0 to  5 kHz,  with  center  frequencies  78  Hz  apart.  The 
output  of  this  process  was  a matrix  array,  corresponding  to  each  word  from  the 
tape,  of  amplitude  values  200  time  windows  lo^e  and  divided  into  64  frequency 
bands.  These  64  frequency  bands  were  then  combined  into  1/3  octave  groups  to 
model  the  frequency  sensitivity  of  the  average  hilqian  ear,  which  reduced  the 
matrix  array  to  16  frequency  bands.  This  processing  of  the  analog  signal  was 
used  to  model  the  preprocessing  which  occurs  in  the  human  ear.  A comparison 
between  each  word  from  the  input  tape  and  the  noisy  output  tape  was  then  made 
using  a weighted  mean  squared  error  calculation.  This  comparison  was  conjec- 
tured to  provide  a difference  measure  which  is  inversely  related  to  intelligi- 
bility. This  comparison  was  used  to  represent  how  intelligible  the  input 
received  from  the  inner  ear  is  to  the  brain. 

Comparison  of  the  intelligibility  results  from  the  human  listener  tests 
with  the  computer  processing  method  outlined  above  gave  a Pearson's  Correlation 
Coefficient  value  of  0.74  which  indicates  the  computer  prediction  accounted  for 
55%  of  the  variance  in  the  listener  error  scores. 
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